This project was in part inspired by a Veritasium video detailing how a video can go viral or in a more general sense, achieve a certain number of views. The problem, in a neural networks point of view, would be giving the network certain inputs, and having it predict the number of views or a function of the number of views. Therefore, in a primtive sense, I am trying to predict some sense of how many views a video will get using a feed forward neural network. The data is a set of features from a set of YouTube videos. The shape of the total data is (Videos x Features) where about 63,938 videos were collected, and 12 features were collected, thus making it of a (63,938 x 12) shape.
However, it must be kept in mind that this is not how the input data is handled when entering it into the neural network. The specific shapes used for inputs will be discussed when showcasing the classification code.
My hypothesis on what makes a video viral is that there are four or so main components to it:
Given the limitations imposed by this project, I had to make do with some aspects of 3 and 5. Part of the reason why I chose to focus on data collection for this project is because I would like to build on top of the effort realized in this project to be able to perform inference on other parts of the video, such as the Thumbnail, and the Title, as we learn about other types of neural networks such as CNNs.
The data consisted of the following features
$$\begin{bmatrix} game \ year\\ game \ name\\ likes\\ views\\ dislikes\\ \frac{likes}{dislikes}\\ \frac{views}{dislikes}\\ \frac{likes}{subscribers}\\ uploadDate\\ channel \end{bmatrix}$$for a selection of about 10,000 videos (or less if the channel had less than that or if something impeded the data collection) from each channel. The collection process made use of the YouTube Data API as well as of the ELK stack (ElasticSearch, LogStash, and Kibana)in order to be able to effectively search through the process. The ELK stack was hosted in AWS (Amazon Web Services) in order to simplify the setup process.
Data Collection took three parts:
The parts will now be explained in more detail.
I decided to gather some youtube channels big and small in order to represent the different aspects of the gaming community. I specifically chose to focus more on individuals who run channels and who primarily showcase themselves playing games. This is opposed to something like IGN, where although it features gaming, I feel that views come primarily from people wanting news about new releases rather than anything else.
Among some of the Youtube Gaming Channel "Types" that I tried to emphasize when collecting channel ID's were:
Gathering a list of video ID's required the use of the Youtube Data API, which has a certain quota that its developers need to satisfy. Specifically, one is allotted 10,000 units per day, where different operations cost different units. The latter is a graph showcasing the different costs associated with each operation. (Youtube Data API Quota Calculator)
| Quota costs | ||
|---|---|---|
| resource | method | cost |
| activities | list | 1 |
| captions | list | 50 |
| insert | 400 | |
| update | 450 | |
| delete | 50 | |
| channelBanners | insert | 50 |
| channels | list | 1 |
| update | 50 | |
| channelSections | list | 1 |
| insert | 50 | |
| update | 50 | |
| delete | 50 | |
| comments | list | 1 |
| insert | 50 | |
| update | 50 | |
| markAsSpam | 50 | |
| setModerationStatus | 50 | |
| delete | 50 | |
| commentThreads | list | 1 |
| insert | 50 | |
| update | 50 | |
| guideCategories | list | 1 |
| i18nLanguages | list | 1 |
| i18nRegions | list | 1 |
| members | list | 1 |
| membershipsLevels | list | 1 |
| playlistItems | list | 1 |
| insert | 50 | |
| update | 50 | |
| delete | 50 | |
| playlists | list | 1 |
| insert | 50 | |
| update | 50 | |
| delete | 50 | |
| search | list | 100 |
| subscriptions | list | 1 |
| insert | 50 | |
| delete | 50 | |
| thumbnails | set | 50 |
| videoAbuseReportReasons | list | 1 |
| videoCategories | list | 1 |
| videos | list | 1 |
| insert | 1600 | |
| update | 50 | |
| rate | 50 | |
| getRating | 1 | |
| reportAbuse | 50 | |
| delete | 50 | |
| watermarks | set | 50 |
| unset | 50 |
One key detail which was painfully found out while trying to collect data was that searching (using the /search?channel domain and then matching the channel ID) is very expensive (100 points out of 10,000 points) and I would run into the problem of running out of units. However, how would one be able to list the videos from a channel if not by searching for them? The answer is through playlists, for which listing would be much cheaper. Additionally, one strange trick I found was that a default playlist of user uploads (obtained by replacing the second character of the user's channel ID with 'U') could be used to cheaply list all the uploads of a channel as opposed to relying on whatever playlists the owner channel has made (usually none or a much more limited scope than the entire set of uploads).
The following contains the code used to collect the video ID's from the selected list of channel ID's. Please note that it is not perfect, as some video ID's failed, but that is why I added a try catch statement. Additionally, there is no real fear of duplicates since later on when processing the data in order to put it into the NN, the video ID's get put into a set.
import os
from google_auth_oauthlib.flow import InstalledAppFlow
import pickle
from requests import api
from apiclient.discovery import build
from tqdm import tqdm
import requests
import json
import re
import time
Api_key = open("apikey.txt", 'r')
Api_key = str(Api_key.read())
# This man saved my life:
# https://www.youtube.com/watch?v=th5_9woFJmk
# Bless him
# Corey Schafer
if os.path.exists('token.pickle'):
print('Loading Credentials From File...')
with open('token.pickle', 'rb') as token:
credentials = pickle.load(token)
print(credentials.to_json())
else:
credentials = ""
# Google's Request
from google.auth.transport.requests import Request
# MAKE SURE REDIRECT URI IN GOOGLE CLOUD SERVICE CONSOLE IS http://localhost:8080/ WITH THE SLASH AFTERWARDS
# If there are no valid credentials available, then either refresh the token or log in.
if not credentials or not credentials.valid:
if credentials and credentials.expired and credentials.refresh_token:
print('Refreshing Access Token...')
credentials.refresh(Request())
else:
print('Fetching New Tokens...')
flow = InstalledAppFlow.from_client_secrets_file(
'client_secret_3.json',
scopes=[
'https://www.googleapis.com/auth/youtube.readonly'
]
)
flow.run_local_server(port=8080, prompt='consent',
authorization_prompt_message='')
credentials = flow.credentials
# Save the credentials for the next run
with open('token.pickle', 'wb') as f:
print('Saving Credentials for Future Use...')
print(credentials.to_json())
pickle.dump(credentials.to_json(), f)
class channelDownloader:
def __init__(self, *,api_key, channelID, targetAmount):
self.api_key = api_key
self.channelID = channelID
self.targetAmount = targetAmount
self.counter = 0
self.total_requests = 0
url = "https://www.googleapis.com/youtube/v3/channels?part=snippet%2CcontentDetails%2Cstatistics%2CbrandingSettings&id=" + str(self.channelID) + "&key=" + str(self.api_key)
resp = requests.get(url)
data = resp.content
data = json.loads(data)
# Wikndows Version
# w = open(".\\channels.txt","a")
w = open("./channels.txt","a")
w.write( str(self.channelID) + "\n")
w.close()
# Windows Version
# os.mkdir(".\\channels\\" + str(self.channelID))
os.mkdir("./channels/" + str(self.channelID))
# Get subscriber Count, videoCount
# Not really necessary just yet
self.subscriberCount = int(data["items"][0]["statistics"]["subscriberCount"])
# self.moderateComments = int(data["items"][0]["brandingSettings"]["moderateComments"])
# self.country = int(data["items"][0]["country"])
self.videoCount = int(data["items"][0]["statistics"]["videoCount"])
# print("Number of videos is ", self.videoCount)
self.nextPageToken = ""
def iterateOverChannelPlaylists(self):
# print("CHANNEL: ", self.channelID, " #######################################")
videos = []
videos_request = youtube.playlistItems().list(part="status,id,snippet,contentDetails", playlistId=self.channelID.replace(self.channelID[1], "U"), maxResults=50)
response_videos = videos_request.execute()
# Windows Version
# w = open(".\\channels\\" + str(self.channelID) + "\\videos.txt","a")
w = open("./channels/" + str(self.channelID) + "/videos.txt","a")
for i in tqdm(range(self.videoCount//50 + 1)):
if(i != 0):
# In here is where the Uploads playlist of the channel is found
# playlistId=self.channelID.replace(self.channelID[1], "U")
# It makes some sence given that most channel ID's look like this:
# UCyMy3i-BaVOmOwTZskm52Ew
# Where the UC probably stands for User Content, and by replacing C with U,
# it turns into User Uploads
videos_request = youtube.playlistItems().list(part="status,id,snippet,contentDetails", playlistId=self.channelID.replace(self.channelID[1], "U"),pageToken=self.nextPageToken, maxResults=50)
response_videos = videos_request.execute()
# Used for iterating over page tokens (50 items per page at max)
self.nextPageToken=response_videos["nextPageToken"]
for video_count, video in enumerate(response_videos["items"]):
# Append video ID into file under /channels/channelID/videos.txt
w.write(str(video["contentDetails"]["videoId"]) + "\n")
# If the channel has less than the target amount or if the number of collected videos surpasses
# the target amount, stop.
if((i)*50 + video_count >= self.targetAmount - 1 or (i)*50 + video_count >= self.videoCount):
w.close()
return videos
w.close()
youtube = build("youtube", "v3", credentials = credentials)
# Here are the different lists I collected. Many of these channels fail out when collecting the uploads
# I don't think the UU trick mentioned above works for all channels, is why. The final list was actually much smaller
# at about 58 channels, and not all of the videos belonging to those channels were collected.
# New list. I added some videos from other games such as smash bros, etc...
# now at 112250 videos.
channels = [
"UCam8T03EOFBsNdR0thrFHdQ",
"UC-cnbLlplnXA4oc7rR21qzg",
"UCL3r1JcBQM1cmyMhMFm6XkQ",
"UCcZUc0Wbt4EPVktbB8FrugQ",
"UCoIXnB865l9Ex9zs4OIXTdQ",
"UCvAR_BDFclXJ1Q9ndUhWLCA",
"UCIIPl-DSCC0prKxGGnJrdGQ",
"UCuJyaxv7V-HK4_qQzNK_BXQ",
"UCRD2CerUvgKHQ-wXWXGPJ-w",
"UCCO9_1dqBcBOXa5nB83--pg",
"UCj1J3QuIftjOq9iv_rr7Egw",
"UCVFWJkN7L45x8gZTMXu2UWw",
"UCWzLmNWhgeh3h1j-M-Isy0g",
"UCfgh3Ul_dG6plQ7rzuOLx-w"
]
# new list added spanish, japanese, german, australian, french youtubers
# now at 83621 videos
"""
"UCXGPGV90SPduyn9LVX9s7Kw",
"UCcG-OmRBrHqiwI91ZbhZt2w",
"UC7tHXQXWImq_0I3PCQ8Udaw",
"UCw1SQ6QRRtfAhrN_cjkrOgA",
"UCWZmCMB7mmKWcXJSIPRhzZw",
"UCYVinkwSX7szARULgYpvhLw",
"UCS5Oz6CHmeoF7vSad0qqXfw",
"UCyMy3i-BaVOmOwTZskm52Ew",
"UC9PD3EIAA-vtGLZgYXG3q0A",
"UCEPuItFWOOJ2o5hTu65NlEg",
"UCdKuE7a2QZeHPhDntXVZ91w",
"UC3C3YOGFjn7Pq3lOCeUFHfg",
"UCUdF4kyAKLyT1xYTjddW5_w",
"UCFR2oaNj02WnXkOgLH0iqOA",
"UCPYJR2EIu0_MJaDeSGwkIVw",
"UCkxctb0jr8vwa4Do6c6su0Q",
"UCpqXJOEqGS-TCnazcHCo0rA",
"UCjlWEMQ2jbTy_2nWq8sEliw",
"UCK0_slr6cUzYFL_dsiuW7Xw",
"UCPbGiUt4Yu8EsaFM-KAmgjg",
"UCKQEfXR_uflA4vDtJHmEIIQ",
"UC5v2QgY2D5tlu8uws23MG4Q",
"UCqg3BHb31w2h77vFuTFmp2w",
"UCH-_hzb2ILSCo9ftVSnrCIQ"
"""
# 50 channels lis, contains big youtubers such as PewDiePie, Markiplier, I think JackSepticEye, SSundae, some LoL dudes, etc...
# GOT ME ABOUT 49422 videos by doing
# for i in tqdm(channels):
# try:
# downloader = channelDownloader(api_key=Api_key,channelID=i, targetAmount=10000)
# downloader.iterateOverChannelPlaylists()
# except Exception as e:
# print(e)
"""
[
"UC_wB4WC7FTdXcjPAqVlS7mA"
"UCSoTXYNzSD9f6fF2IvRVHdA",
"UCQSEAbOs6vsJfy7WN7iYaGQ",
"UCI3DTtB-a3fJPjKtQ5kYHfA",
"UC8aG3LDTDwNR1UQhSn9uVrw",
"UCHdMK5Ef2El8KbD3L_WgANg",
"UCn4BNPzJDyxkoBgXRuOFeyA",
"UCIPPMRA040LQr5QPyJEbmXA",
"UCEuN0IauvXNTn6KhP9AVJVw",
"UCzYfz8uibvnB7Yc1LjePi4g",
"UCiSVf-UpLC9rRjAT1qRTW0g",
"UC_cvTMeip9po2hZdF3aBXrA",
"UCke6I9N4KfC968-yRcd5YRg",
"UCJZam2u1G0syq3kyqrCXrNw",
"UCsC7Bac1Jpg28k0BSMEz6ZA",
"UC1bwliGvJogr7cWK0nT2Eag",
"UC7_YxT-KID8kRbqZo7MyscQ",
"UC-lHJZR3Gqxm24_Vd_AJ5Yw",
"UCYzPXprvl5Y-Sf0g4vX-m6g",
"UCNAz5Ut1Swwg6h6ysBtWFog",
"UGGiMCgtLmwNSEUcIspB3tnA",
"UCHIEZ3ZK3KWrSJQhsiFq1DA",
"UCuK2rjMid48EA6XoQPe4PIA",
"UCk8-Skcx2jfBsOtHgmzKKhg",
"UCKhRm8klDao9Nq0aBC0KmBw",
"UCP_1oRZ0RjNwgYmmW-CZieg",
"UC0VVYtw21rg2cokUystu2Dw",
"UCZ9lCUhUOUrwqVJmfBkN92A",
"UCcAbV5NEHxbhdVKHirkoObA",
"UCBXN0W_SEgbcLWjB5ycK5Ug",
"UC_AjfAO4mQWl9215YpVZEog",
"UC-hSZv6tzHEUqrUB6mgXvIA",
"UC-SpacLBhJEHbedPuU5QI3w",
"UC_oCw5PLyGQEJvRtlkjwS6A",
"UCYQ2ZdAFYep5-t91WFEdLog",
"UCPgj-dhrdajrJDP2WQjgPvQ",
"UCsM3qWjsbJr5Kt7_Q1ulLjg",
"UC-9XXmqSEMOBU6bv2ESl7CA",
"UCzCMYNVFeF8wluyfW5_4V8w",
"UC5QgPBgPMM6gSV7hQcHgOhg",
"UC7LEEvPCzPAtHLbQeYuJo6w",
"UCO1YGhO4dFcVxzYBU1oTVdQ",
"UCT7VEYOLH5CxNPZJr54Xp9g",
"UCMwQLdr2C6GTSmPgQEy-lSg",
"UCj0sKQk9qjiIfNH22ZBWpxw",
"UCvzoHz_9Vym9HUJ3lNRfCcw",
"UCxo56gzJQ_fhb6svPqTSewg",
"UCmSfdtegCsyBe8qCnRu-ijw",
"UC279mtSpGdNDRVE_NqSPofA",
"UC6Fr4tBYuLHgPkL2tEkTrSA",
"UCRGfxiszv5dhqRLA-OpD6nQ"]
"""
# Goes through channels, and runs the downloader. The target amount is set to 10k videos, but if the channel has less than that, it stops
for i in tqdm(channels):
try:
downloader = channelDownloader(api_key=Api_key,channelID=i, targetAmount=10000)
downloader.iterateOverChannelPlaylists()
except Exception as e:
print(e)
import os
from google_auth_oauthlib.flow import InstalledAppFlow
import pickle
from datetime import datetime
from requests import api
from apiclient.discovery import build
from tqdm import tqdm
import requests
import json
import re
import time
import requests
Api_key = open("apikey.txt", 'r')
Api_key = str(Api_key.read())
if os.path.exists('token.pickle'):
print('Loading Credentials From File...')
with open('token.pickle', 'rb') as token:
credentials = pickle.load(token)
print(credentials.to_json())
else:
credentials = ""
# This man saved my life:
# https://www.youtube.com/watch?v=th5_9woFJmk
# Bless him
# Corey Schafer
from google.auth.transport.requests import Request
# MAKE SURE REDIRECT IN GOOGLE CLOUD SERVICES CONSOLE IS http://localhost:8080/ WITH THE SLASH AFTERWARDS
# If there are no valid credentials available, then either refresh the token or log in.
if not credentials or not credentials.valid:
if credentials and credentials.expired and credentials.refresh_token:
print('Refreshing Access Token...\n\n\n')
credentials.refresh(Request())
else:
print('Fetching New Tokens...\n\n\n')
flow = InstalledAppFlow.from_client_secrets_file(
'client_secret_3.json',
scopes=[
'https://www.googleapis.com/auth/youtube.readonly'
]
)
flow.run_local_server(port=8080, prompt='consent',
authorization_prompt_message='')
credentials = flow.credentials
# Save the credentials for the next run
with open('token.pickle', 'wb') as f:
print('Saving Credentials for Future Use...')
print(credentials.to_json())
pickle.dump(credentials.to_json(), f)
class channelUploader:
def __init__(self, *,api_key, channelID):
self.api_key = api_key
self.channelID = channelID
self.counter = 0
self.total_requests = 0
url = "https://www.googleapis.com/youtube/v3/channels?part=snippet%2Cstatus%2CcontentDetails%2Cstatistics%2CbrandingSettings&id=" + str(self.channelID) + "&key=" + str(self.api_key)
resp = requests.get(url)
# print(url)
data = resp.content
data = json.loads(data)
# Wikndows Version
# w = open(".\\channels.txt","a")
w = open("./channels.txt","a")
w.write( str(self.channelID) + "\n")
w.close()
# Windows Version
# os.mkdir(".\\channels\\" + str(self.channelID))
# os.mkdir("./channels/" + str(self.channelID))
self.subscriberCount = int(data["items"][0]["statistics"]["subscriberCount"])
moderateComments = False
madeForKids = False
country = ""
if("moderateComments" in data["items"][0]["brandingSettings"]["channel"]):
moderateComments = bool(data["items"][0]["brandingSettings"]["channel"]["moderateComments"])
if("madeForKids" in data["items"][0]["status"]):
madeForKids = bool(data["items"][0]["status"]["madeForKids"])
if("country" in data["items"][0]["brandingSettings"]["channel"]):
country = str(data["items"][0]["brandingSettings"]["channel"]["country"])
output = {
"country" : country,
"videoCount": int(data["items"][0]["statistics"]["videoCount"]),
"totalViewCount": int(data["items"][0]["statistics"]["viewCount"]),
"subscriberCount" : int(data["items"][0]["statistics"]["subscriberCount"]),
"moderateComments" :moderateComments,
"madeForKids" : madeForKids,
"title" : str(data["items"][0]["brandingSettings"]["channel"]["title"]),
"description" : str(data["items"][0]["snippet"]["description"]),
"thumbnail" : str(data["items"][0]["snippet"]["thumbnails"]["default"]["url"]),
"channelInception" : str(data["items"][0]["snippet"]["publishedAt"])
}
headers = {
"Authorization" : "Basic ZG5pZXZlc2FjYXJvMjAxODpGbG9yaWRhX1RlY2hfQGNjZXNzX0RlbkFfMjM3NDMzNjI2NzczNTc=",
'Content-Type': 'application/json'
}
response = requests.post("https://search-mth5320-project1-2-2t7qhhcg23lmtwmbp6m4pwlcya.us-east-2.es.amazonaws.com/channels/_doc", data=json.dumps(output),headers=headers)
# print("\n\n")
# print(response.status_code)
# print(response.content)
# print("Number of videos is ", self.videoCount)
# self.nextPageToken = ""
# print(output)
# return url
def getVidInfo(self, *,videoID, uploader, document_id):
url = "???"
try:
url = "https://www.youtube.com/watch?v=" + str(videoID)
# print(url)
resp = requests.get(url)
except Exception as e:
print(url)
print(e)
# print(resp.content)
# game_title = re.findall(r'/(<div id="title" class="style-scope ytd-rich-metadata-renderer">)[A-z]*</div>/g', str(resp.content))
# file = open("test.txt" , "r")
# t = file.read()
# w = open("curloutput.txt", 'w')
# w.write(str(resp.content))
upload_date_indx = str(resp.content).find("<meta itemprop=\"uploadDate\" content=")
upload_date = str(resp.content)[upload_date_indx + len("<meta itemprop=\"uploadDate\" content=") + 1 :upload_date_indx + len("<meta itemprop=\"uploadDate\" content=") + 12 -1]
upload_date.replace("\"", "")
# print(upload_date)
time = datetime(year=int(upload_date[:4]),month=int(upload_date[5:7]),day=int(upload_date[8:]))
# print("Month is ", time.month)
# print("Year is ", time.year)
# print("Day is ", time.day)
time = time.strftime("%Y-%m-%dT%H:%M:%SZ")
upload_date = time
game_name_indx = str(resp.content).find("}]},\"title\":{\"simpleText\":")
# game_title = re.findall('}]},"title":{"simpleText":', str(resp.content))
# print(game_name_indx)
game_name = str(resp.content)[game_name_indx:game_name_indx+100].replace("}]},\"title\":{\"simpleText\":\"", "")
# print(game_name)
game_year = game_name[game_name.find("\"},\"subtitle\":{\"simpleText\":\"") + len("\"},\"subtitle\":{\"simpleText\":\""):].replace("\"},\"callToAction\":{\"","")[:4]
# in case the game year is legit so that I can turn it into an int without having an exception
temp = game_name
game_name = game_name[:game_name.find("},\"subtitle\":{")-1]
if(game_name.find(":{\"")!=-1 or not (game_year.isdigit())):
game_name="N/A"
game_year="N/A"
else:
game_year = int(temp[temp.find("\"},\"subtitle\":{\"simpleText\":\"") + len("\"},\"subtitle\":{\"simpleText\":\""):].replace("\"},\"callToAction\":{\"","")[:4])
likes_indx = str(resp.content).find("{\"iconType\":\"LIKE\"},\"defaultText\":{\"accessibility\":{\"accessibilityData\":{\"label\":\"") + len("{\"iconType\":\"LIKE\"},\"defaultText\":{\"accessibility\":{\"accessibilityData\":{\"label\":\"")
likes = str(resp.content)[likes_indx:likes_indx + 50]
likes = str(likes[:likes.find("likes")])
likes = likes.strip()
likes= likes.replace(",", "")
likes = int(likes)
# print(url)
dislike_find_str = "{\"iconType\":\"DISLIKE\"},\"defaultText\":{\"accessibility\":{\"accessibilityData\":{\"label\":\""
dislikes_indx = str(resp.content).find(dislike_find_str) + len(dislike_find_str)
dislikes = str(resp.content)[dislikes_indx:dislikes_indx + 50]
# print("Dislikes: ", dislikes)
# dislikes = str(dislikes[dislikes.find("simpleText\":\"") + len("simpleText\":\""):])
dislikes = str(dislikes[:dislikes.find("\"")])
# dislikes = dislikes[:dislikes.find("\"}")]
#
# print("Dislikes: ", dislikes)
dislikes = dislikes.strip()
# print("Dislikes: ", dislikes)
dislikes= dislikes.replace(" dislikes", "")
dislikes= dislikes.replace(" dislike", "")
dislikes= dislikes.replace(",", "")
# print("Dislikes: ", dislikes)
if(dislikes == "No"):
dislikes = 0
else:
dislikes = int(dislikes)
viewcount_indx = str(resp.content).find("\"viewCount\":{\"videoViewCountRenderer\":{\"viewCount\":{\"simpleText\":\"") + len("\"viewCount\":{\"videoViewCountRenderer\":{\"viewCount\":{\"simpleText\":\"")
viewcount = str(resp.content)[viewcount_indx:viewcount_indx + 50]
# print("ViewCount", viewcount)
viewcount = str(viewcount[:viewcount.find(" views")])
# print("ViewCount", viewcount)
viewcount = viewcount.strip()
# print("ViewCount", viewcount)
viewcount= viewcount.replace(",", "")
# print("ViewCount", viewcount)
# print(url)
if(viewcount == "No"):
viewcount = 0
else:
viewcount = int(viewcount)
# print("Game Year of Pub: ", game_year)
# print("Game Name: ", game_name)
# print("Likes: ", likes)
# print("ViewCount", viewcount)
# print("Dislikes: ", dislikes)
# print("ViewCount / SubscriberCount: ", viewcount / subscriberCount)
# print("SubscriberCount: ",subscriberCount)
# print("Upload Date: ", upload_date)
# print(e)
# comment_count_indx = str(resp.content).find() + len()
# comment_count = str(resp.content)[comment_count_indx:comment_count_indx + 50]
# comment_count = str(comment_count[:comment_count.find(" comment_s")])
# comment_count = comment_count.strip()
# comment_count= comment_count.replace(",", "")
# comment_count = int(comment_count)
# print(url)
# print("Game Year of Pub: ", game_year)
# print("Game Name: ", game_name)
# print("Likes: ", likes)
# print("ViewCount", viewcount)
# print("Dislikes: ", dislikes)
# print("ViewCount / SubscriberCount: ", viewcount / subscriberCount)
# print("SubscriberCount: ",subscriberCount)
# print("Upload Date: ", upload_date)
l_div_d = 0
l_div_s = 0
v_div_s = 0
if(dislikes == 0):
l_div_d = -99
else:
l_div_d = likes/dislikes
if(self.subscriberCount == 0):
l_div_s = -99
v_div_s = -99
else:
l_div_s = likes/self.subscriberCount
v_div_s = viewcount/self.subscriberCount
output = {
"id" : videoID,
"game_year": game_year,
"game_name": game_name,
"likes": likes,
"views": viewcount,
"dislikes": dislikes,
"l/d": l_div_d,
"v/s" : v_div_s,
"l/s" : l_div_s,
"uploadDate" : upload_date,
"channel" : self.channelID,
"uploader" : uploader
}
headers = {
"Authorization" : "Basic ZG5pZXZlc2FjYXJvMjAxODpGbG9yaWRhX1RlY2hfQGNjZXNzX0RlbkFfMjM3NDMzNjI2NzczNTc=",
'Content-Type': 'application/json'
}
response = requests.post("https://search-mth5320-project1-2-2t7qhhcg23lmtwmbp6m4pwlcya.us-east-2.es.amazonaws.com/" + document_id + "/_doc", data=json.dumps(output),headers=headers)
# channel separation for distributing the work
list_1 = [
"UC0VVYtw21rg2cokUystu2Dw",
"UC1bwliGvJogr7cWK0nT2Eag",
"UC5QgPBgPMM6gSV7hQcHgOhg",
"UC5v2QgY2D5tlu8uws23MG4Q",
"UC6Fr4tBYuLHgPkL2tEkTrSA",
"UC7_YxT-KID8kRbqZo7MyscQ",
"UC8aG3LDTDwNR1UQhSn9uVrw",
"UC9PD3EIAA-vtGLZgYXG3q0A",
"UCam8T03EOFBsNdR0thrFHdQ"
]
list_2 = [
"UCBXN0W_SEgbcLWjB5ycK5Ug",
"UCcAbV5NEHxbhdVKHirkoObA",
"UCcG-OmRBrHqiwI91ZbhZt2w",
"UC-cnbLlplnXA4oc7rR21qzg",
"UC_cvTMeip9po2hZdF3aBXrA",
"UCcZUc0Wbt4EPVktbB8FrugQ",
"UCdKuE7a2QZeHPhDntXVZ91w",
"UCEPuItFWOOJ2o5hTu65NlEg",
"UCEuN0IauvXNTn6KhP9AVJVw"
]
list_3 = [
"UCfgh3Ul_dG6plQ7rzuOLx-w",
"UCFR2oaNj02WnXkOgLH0iqOA",
"UCHdMK5Ef2El8KbD3L_WgANg",
"UC-hSZv6tzHEUqrUB6mgXvIA",
"UCI3DTtB-a3fJPjKtQ5kYHfA",
"UCIPPMRA040LQr5QPyJEbmXA",
"UCj0sKQk9qjiIfNH22ZBWpxw",
"UCj1J3QuIftjOq9iv_rr7Egw",
"UCjlWEMQ2jbTy_2nWq8sEliw"
]
list_6 = [
"UCK0_slr6cUzYFL_dsiuW7Xw",
"UCk8-Skcx2jfBsOtHgmzKKhg",
"UCKQEfXR_uflA4vDtJHmEIIQ",
"UCkxctb0jr8vwa4Do6c6su0Q",
"UCL3r1JcBQM1cmyMhMFm6XkQ",
"UC-lHJZR3Gqxm24_Vd_AJ5Yw",
"UCn4BNPzJDyxkoBgXRuOFeyA",
"UCNAz5Ut1Swwg6h6ysBtWFog",
"UCO1YGhO4dFcVxzYBU1oTVdQ"
]
list_5 = [
"UCoIXnB865l9Ex9zs4OIXTdQ",
"UCPbGiUt4Yu8EsaFM-KAmgjg",
"UCPgj-dhrdajrJDP2WQjgPvQ",
"UCPYJR2EIu0_MJaDeSGwkIVw",
"UCqg3BHb31w2h77vFuTFmp2w",
"UCQSEAbOs6vsJfy7WN7iYaGQ",
"UCRGfxiszv5dhqRLA-OpD6nQ",
"UCsM3qWjsbJr5Kt7_Q1ulLjg",
"UC-SpacLBhJEHbedPuU5QI3w"
]
list_7 = [
"UCUdF4kyAKLyT1xYTjddW5_w",
"UCuJyaxv7V-HK4_qQzNK_BXQ",
"UCuK2rjMid48EA6XoQPe4PIA",
"UCVFWJkN7L45x8gZTMXu2UWw",
"UCw1SQ6QRRtfAhrN_cjkrOgA",
"UCWzLmNWhgeh3h1j-M-Isy0g",
"UCXGPGV90SPduyn9LVX9s7Kw",
"UCxo56gzJQ_fhb6svPqTSewg",
"UCyMy3i-BaVOmOwTZskm52Ew"
]
list_8 = [
"UCYQ2ZdAFYep5-t91WFEdLog",
"UCYVinkwSX7szARULgYpvhLw",
"UCYzPXprvl5Y-Sf0g4vX-m6g",
"UCzYfz8uibvnB7Yc1LjePi4g"
]
# Windows Version:
directory = r'C:\\Users\SplashFreeze\Desktop\Semester 7\\MTH 4312\\Project1\\channels'
# directory = '/home/davidn/Documents/MTH-5320/Project1/channels'
failed_channels = []
parallel = []
# Change list_1
# For each computer you want to distri
# bute the work in
for filename in tqdm(list_2):
# print(os.path.join(directory, filename))
# try:
# Windows Version
o = open(os.path.join(directory, filename) + "\\videos.txt","r")
# o = open(os.path.join(directory, filename) + "/videos.txt","r")
# print(filename)
u = channelUploader(api_key=Api_key, channelID=filename)
lines = o.readlines()
# print(lines)
# lines = [lines.rstrip() for line in lines]
# # with open()
# Set this equal to a username or identifier for computer
# in case your computer screws up and you have to restart the upload process
uploader = "HP-laptop"
for count, line in tqdm(enumerate(lines)):
u.getVidInfo(videoID = line.replace("\n", ""), uploader=uploader,document_id="videos2")
# except Exception as e:
# print("Channel: " + filename, "could not get videos...")
# print(e)
# failed_channels.append(filename)
print("Failed channels are", failed_channels)
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
image = mpimg.imread("images/dashboard.png")
plt.figure(figsize=(200,100))
plt.imshow(image)
plt.show()
The setup of the ELK stack (sometimes simply referred to as ElasticSearch) was done by simply creating an AWS account and following the instructions required. One caveat is that both from the perspective of a college student, and in the interest of fairness, I only used free tier limits on this ELK setup, which meant using two t3.small instances with 20GiB in total of EBS volumes. The reason I chose ElasticSearch was because of the powerful visualizations it provides in the form of Kibana. Here are a few examples:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
image = mpimg.imread("images/top_15.png")
plt.figure(figsize=(200,100))
plt.imshow(image)
plt.show()
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
image = mpimg.imread("images/top_50.png")
plt.figure(figsize=(200,100))
plt.imshow(image)
plt.show()
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
image = mpimg.imread("images/unique.png")
plt.figure(figsize=(200,100))
plt.imshow(image)
plt.show()
In order to speed up the process of training, I downloaded a copy of the data locally by querying ElasticSearch. One might think that this defeats the point of having a database, but the true value of ElasticSearch lies in its searching capabilities, its visualizations, as well as the fact that I have all the data in one easily accessible cloud-hosted location where I only have to collect the data once from YouTube and be done with it.
import json
import os
import math
import pickle
from datetime import datetime
from requests import api
import numpy as np
from tqdm import tqdm
import requests
import json
import re
import time
import requests
import ast
import codecs
j = open("sample_videos.json", "r",encoding="utf-8")
j = j.read()
j = codecs.decode(j, "unicode-escape")
w = json.loads(j)
w = [x["_source"] for x in w]
# print(w[0])
seen = dict()
a = []
for elem in w:
if elem['id'] in seen:
continue
else:
seen[elem['id']] = True
a.append(elem)
print(a[:2])
[{"id": "VE4VbuV2jVU", "game_year": 2017, "game_name": "Rime", "likes": 3961, "views": 57890, "dislikes": 48, "l/d": 82.52083333333333, "v/s": 0.04256617647058823, "uploadDate": "2017-06-06T00:00:00Z", "channel": "UCXGPGV90SPduyn9LVX9s7Kw"}, {"id": "zrtdVAE1EK8", "game_year": 2017, "game_name": "Arms", "likes": 16851, "views": 567993, "dislikes": 321, "l/d": 52.495327102803735, "v/s": 0.4176419117647059, "uploadDate": "2017-06-06T00:00:00Z", "channel": "UCXGPGV90SPduyn9LVX9s7Kw"}]
For inputting one of the most important features, I made a function which generates a one-hot encoding of which game is being played in the video. Note that I only used the top 15 categories, including the "N/A" category, which includes off-topic videos which might talk about video games or something video game related, but might not necessarily have a video game entry in the video's page, as seen below:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
image = mpimg.imread("images/ex_entry.png")
plt.figure(figsize=(200,100))
plt.imshow(image)
plt.show()
# This list was made with the help of Kibana
top_15 = ["N/A",
"Minecraft",
"Fortnite",
"Grand Theft Auto V",
"Roblox",
"Among Us",
"Happy Wheels",
"Five Nights at Freddys",
"PUBG MOBILE",
"Garrys Mod",
"Tom Clancys Rainbow Six Siege",
"Call of Duty: Black Ops III",
"Subnautica",
"Call of Duty: Advanced Warfare",
"Hello Neighbor"]
def one_hot_games(game, unique_games):
# automatically adds an "other" category to mark for the non-top 15 in our case
a = [0]*(len(unique_games) + 1)
try:
a[unique_games.index(game)] = 1
except Exception as e:
# if it is not in any, simply add a 1 to the "other" category
a[-1] = 1
return a
As for processing and normalizing the inputs, I decided to go firsthand by taking several ratios, such as
"l/d" $\frac{likes}{dislikes}$,
"v/s" or $\frac{views}{subscribers}$,
as well as a non-ratio such as the raw number of subscribers into a power and then dividing that by the max value in order to normalize. I made sure that my values were sensical by printing out the entry to which the max value belonged to. For example, the max subscribers in this whole list belongs to the user PewDiePie, which makes sense.
l_div_d = [math.log(x["l/d"]) if x["l/d"] >0 else 0.1 for x in a]
max_l_div_d = l_div_d[np.argmax(l_div_d)]
print("Max l_div_d is ","https://youtube.com/watch?v=" +str(a[np.argmax(l_div_d)]["id"]))
print("Belongs to ", a[np.argmax(l_div_d)]["id"])
v_div_s = [math.log(x["v/s"]) if x["l/d"] > 0 else 0.1 for x in a]
max_v_div_s = v_div_s[np.argmax(v_div_s)]
print("Max v_dix_s is ", max_v_div_s)
print("Belongs to ", "https://youtube.com/watch?v=" + str(a[np.argmax(v_div_s)]["id"]))
v_div_s = [math.log(x["v/s"]) for x in a]
max_v_div_s = v_div_s[np.argmax(v_div_s)]
print("Max v_dix_s is ", max_v_div_s)
print("Belongs to ", "https://youtube.com/watch?v=" + str(a[np.argmax(v_div_s)]["id"]))
subscribers = [math.log(max(0.1, x["views"]/x["v/s"])) for x in a]
max_subs = subscribers[np.argmax(subscribers)]
print("Max subscribers is ", max_subs)
print("Belongs to ", "https://youtube.com/watch?v=" + str(a[np.argmax(subscribers)]["id"]))
# np.argmax(])
inputs = [([math.log(max(0.1, x["l/d"]))/max_l_div_d,math.log(max(0,x["v/s"]))/max_v_div_s,math.log(max(0.1,x["views"]/x["v/s"]))/max_subs ] +one_hot_games(x["game_name"], top_15)) for x in a]
print("Ex. First entry of inputs is ")
print(inputs[0])
inputs = np.array(inputs,dtype="float")
# print(inputs[0].shape)
outputs = [[min(4,max(0, math.log( x["views"], 10)-3))] for x in a]
outputs = np.array(outputs,dtype="float")
# print(outputs.shape)
Max l_div_d is https://youtube.com/watch?v=THbdeQrnHm0 Belongs to THbdeQrnHm0 Max v_dix_s is 3.3187360106384216 Belongs to https://youtube.com/watch?v=bSN27iYjcdc Max v_dix_s is 3.3187360106384216 Belongs to https://youtube.com/watch?v=bSN27iYjcdc Max subscribers is 18.51599092375669 Belongs to https://youtube.com/watch?v=YlXFDUL6Qa4 Ex. First entry of inputs is [0.6819946722401631, -0.9511739742239801, 0.7627458511870363, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Then, I try using a benchmark of a one-layer neural net (essentially multi-class linear regression) to serve as a benchmark of how the more complicated architecture of a neural network along with its associated hyperparameters improves the results.
# load some more libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import datasets
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import mnist
import numpy as np
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
import matplotlib.pyplot as plt
import pandas as pd
np.set_printoptions(linewidth=180)
class FeedforwardNeuralNetworkSGD:
def __init__(self, layers, alpha = 0.1, batchSize = 32):
self.W = []
self.layers = layers
self.alpha = alpha
# Change in Activation function!
# C = 2.5
self.c = 2.5
derivatice_elu_single = lambda x: (self.c * np.exp(x)) if(x <=0) else 1
self.vectorized_ELU_Derivatice= np.vectorize(derivatice_elu_single)
elu_single = lambda x: (self.c * np.exp(x) - 1) if(x <=0) else x
self.vectorized_ELU = np.vectorize(elu_single)
self.batchSize = batchSize
for i in np.arange(0, len(layers) - 2):
self.W.append(np.random.randn(layers[i] + 1, layers[i + 1] + 1)/100.0)
self.W.append(np.random.randn(layers[-2] + 1, layers[-1])/100.0)
def sigmoid(self, x):
return 1.0 / (1 + np.exp(-x))
def sigmoidDerivative(self, z):
return z * (1 - z)
def getNextBatch(self, X, y, batchSize):
for i in np.arange(0, X.shape[0], batchSize):
yield (X[i:i + batchSize], y[i:i + batchSize])
def fit(self, X, y, epochs = 10000, update = 1000):
X = np.hstack((X, np.ones([X.shape[0],1])))
for epoch in tqdm(np.arange(0,epochs)):
p = np.arange(0,X.shape[0])
np.random.shuffle(p)
X = X[p]
y = y[p]
for (x, target) in self.getNextBatch(X, y, self.batchSize):
A = [np.atleast_2d(x)]
B = [np.atleast_2d(x)]
for layer in np.arange(0, len(self.W)):
net = A[layer].dot(self.W[layer])
out = self.vectorized_ELU(net)
A.append(out)
B.append(net)
error = A[-1] - target
D = [error * self.vectorized_ELU_Derivatice(B[-1])]
for layer in np.arange(len(A) - 2, 0, -1):
delta = D[-1].dot(self.W[layer].T)
delta = delta * self.vectorized_ELU_Derivatice(B[layer])
D.append(delta)
D = D[::-1]
for layer in np.arange(0, len(self.W)):
self.W[layer] -= self.alpha * A[layer].T.dot(D[layer])
def predict(self, X, addOnes = True):
p = np.atleast_2d(X)
if addOnes:
p = np.hstack((p, np.ones([X.shape[0],1])))
for layer in np.arange(0, len(self.W)):
p = self.vectorized_ELU(np.dot(p, self.W[layer]))
return p
def computeLoss(self, X, y):
y = np.atleast_2d(y)
predictions = self.predict(X, addOnes = False)
loss = np.sum((predictions - y)**2) / 2.0
return loss
np.random.seed(7468378)
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
# trainX = trainX.astype('float64')/255.0
# testX = testX.astype('float64')/255.0
# trainX = trainX.reshape([60000, 28*28])
# testX = testX.reshape([10000, 28*28])
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([devX.shape[1],testY.shape[1]], 0.005, 16)
model.fit(devX, devY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(devX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
print(classification_report(devY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19) Outputs shape is (4619, 5)
100%|██████████| 100/100 [00:10<00:00, 9.26it/s] C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
Training set accuracy
precision recall f1-score support
0 0.98 0.55 0.71 206
1 0.72 0.47 0.57 809
2 0.75 0.79 0.77 2070
3 0.75 0.93 0.83 1441
4 0.00 0.00 0.00 94
accuracy 0.75 4620
macro avg 0.64 0.55 0.58 4620
weighted avg 0.74 0.75 0.73 4620
Test set accuracy
precision recall f1-score support
0 0.97 0.54 0.70 212
1 0.69 0.45 0.55 833
2 0.73 0.78 0.76 1999
3 0.75 0.94 0.83 1456
4 0.00 0.00 0.00 119
accuracy 0.74 4619
macro avg 0.63 0.54 0.57 4619
weighted avg 0.72 0.74 0.72 4619
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
As can be seen, the benchmark already does decently on the dataset, but we will see that adding more layers (thus turning it from linear regression into the realm of neural networks) improves the results.
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([devX.shape[1],16,16,testY.shape[1]], 0.005, 16)
model.fit(devX, devY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(devX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
print(classification_report(devY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19) Outputs shape is (4619, 5)
100%|██████████| 100/100 [00:58<00:00, 1.70it/s]
Training set accuracy
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
precision recall f1-score support
0 0.82 0.84 0.83 206
1 0.70 0.60 0.65 809
2 0.73 0.89 0.80 2070
3 0.88 0.73 0.80 1441
4 0.00 0.00 0.00 94
accuracy 0.77 4620
macro avg 0.63 0.61 0.62 4620
weighted avg 0.76 0.77 0.76 4620
Test set accuracy
precision recall f1-score support
0 0.83 0.78 0.80 212
1 0.73 0.60 0.66 833
2 0.72 0.90 0.80 1999
3 0.88 0.74 0.80 1456
4 0.00 0.00 0.00 119
accuracy 0.77 4619
macro avg 0.63 0.60 0.61 4619
weighted avg 0.76 0.77 0.75 4619
For now we will perform some experiments with the learning rate. Perhaps a more aggressive learning rate of $\alpha = 0.1$ will result in more accurate results.
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([devX.shape[1],16,16,testY.shape[1]], 0.1, 16)
model.fit(devX, devY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(devX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
print(classification_report(devY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19) Outputs shape is (4619, 5)
100%|██████████| 100/100 [00:51<00:00, 1.95it/s]
Training set accuracy
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
precision recall f1-score support
0 0.05 1.00 0.09 218
1 0.00 0.00 0.00 839
2 0.00 0.00 0.00 2060
3 0.00 0.00 0.00 1413
4 0.00 0.00 0.00 90
accuracy 0.05 4620
macro avg 0.01 0.20 0.02 4620
weighted avg 0.00 0.05 0.00 4620
Test set accuracy
precision recall f1-score support
0 0.04 1.00 0.08 200
1 0.00 0.00 0.00 803
2 0.00 0.00 0.00 2009
3 0.00 0.00 0.00 1484
4 0.00 0.00 0.00 123
accuracy 0.04 4619
macro avg 0.01 0.20 0.02 4619
weighted avg 0.00 0.04 0.00 4619
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
Based on this result as well as previous results from other assignments, I will continue to use 0.005 or even smaller gradients. Let me try 0.001.
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([devX.shape[1],16,16,testY.shape[1]], 0.001, 16)
model.fit(devX, devY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(devX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
print(classification_report(devY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19) Outputs shape is (4619, 5)
100%|██████████| 100/100 [00:34<00:00, 2.90it/s]
Training set accuracy
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
precision recall f1-score support
0 0.73 0.28 0.40 219
1 0.62 0.69 0.65 831
2 0.70 0.74 0.72 2000
3 0.73 0.74 0.74 1475
4 0.00 0.00 0.00 95
accuracy 0.69 4620
macro avg 0.56 0.49 0.50 4620
weighted avg 0.68 0.69 0.68 4620
Test set accuracy
precision recall f1-score support
0 0.75 0.32 0.45 199
1 0.64 0.67 0.65 811
2 0.71 0.74 0.73 2069
3 0.69 0.75 0.72 1422
4 0.00 0.00 0.00 118
accuracy 0.69 4619
macro avg 0.56 0.50 0.51 4619
weighted avg 0.68 0.69 0.68 4619
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
This one did about the same as $\alpha = 0.005$ but given its inferior results with respect to the 4 classification, I will proceed with $\alpha = 0.005$. Next I will try varying the network architecture by doubling the size of the first hidden layer. With any luck, this should be able to predict more accurately.
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([devX.shape[1],32,16,testY.shape[1]], 0.005, 16)
model.fit(devX, devY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(devX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
print(classification_report(devY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19) Outputs shape is (4619, 5)
100%|██████████| 100/100 [01:34<00:00, 1.06it/s]
Training set accuracy
precision recall f1-score support
0 0.87 0.94 0.91 218
1 0.78 0.81 0.79 802
2 0.75 0.85 0.80 2025
3 0.86 0.69 0.76 1474
4 0.95 0.60 0.74 101
accuracy 0.79 4620
macro avg 0.84 0.78 0.80 4620
weighted avg 0.80 0.79 0.79 4620
Test set accuracy
precision recall f1-score support
0 0.83 0.92 0.87 200
1 0.80 0.80 0.80 840
2 0.75 0.87 0.81 2044
3 0.87 0.69 0.77 1423
4 0.96 0.66 0.78 112
accuracy 0.80 4619
macro avg 0.84 0.79 0.81 4619
weighted avg 0.81 0.80 0.80 4619
That only did slightly better than the input, 16, 16, output configuration seen above. Perhaps decreasing the batch size can help?
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([devX.shape[1],32,16,testY.shape[1]], 0.005, 4)
model.fit(devX, devY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(devX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
print(classification_report(devY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19) Outputs shape is (4619, 5)
100%|██████████| 100/100 [01:23<00:00, 1.19it/s]
Training set accuracy
precision recall f1-score support
0 0.90 0.60 0.72 184
1 0.87 0.54 0.66 858
2 0.68 0.95 0.79 2029
3 0.90 0.64 0.75 1449
4 0.44 0.49 0.46 100
accuracy 0.75 4620
macro avg 0.76 0.64 0.68 4620
weighted avg 0.79 0.75 0.74 4620
Test set accuracy
precision recall f1-score support
0 0.92 0.62 0.74 234
1 0.87 0.60 0.71 784
2 0.69 0.95 0.80 2040
3 0.88 0.60 0.72 1448
4 0.43 0.44 0.44 113
accuracy 0.75 4619
macro avg 0.76 0.64 0.68 4619
weighted avg 0.78 0.75 0.75 4619
Perhaps a happy medium of batch size = 8 can be found?
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([devX.shape[1],32,16,testY.shape[1]], 0.005, 8)
model.fit(devX, devY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(devX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
print(classification_report(devY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19) Outputs shape is (4619, 5)
100%|██████████| 100/100 [01:14<00:00, 1.33it/s]
Training set accuracy
precision recall f1-score support
0 0.98 0.79 0.88 212
1 0.71 0.77 0.74 780
2 0.76 0.82 0.79 2036
3 0.81 0.67 0.73 1475
4 0.27 0.44 0.34 117
accuracy 0.75 4620
macro avg 0.71 0.70 0.70 4620
weighted avg 0.77 0.75 0.76 4620
Test set accuracy
precision recall f1-score support
0 0.99 0.84 0.91 206
1 0.76 0.77 0.77 862
2 0.76 0.82 0.79 2033
3 0.81 0.70 0.75 1422
4 0.25 0.41 0.31 96
accuracy 0.77 4619
macro avg 0.71 0.71 0.71 4619
weighted avg 0.78 0.77 0.77 4619
Clearly, the batch size of 16 does better. Perhaps increasing the batch size to 32 can yield better results?
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([devX.shape[1],32,16,testY.shape[1]], 0.005, 32)
model.fit(devX, devY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(devX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
print(classification_report(devY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19) Outputs shape is (4619, 5)
100%|██████████| 100/100 [01:24<00:00, 1.19it/s]
Training set accuracy
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
precision recall f1-score support
0 0.04 1.00 0.08 198
1 0.00 0.00 0.00 821
2 0.00 0.00 0.00 2008
3 0.00 0.00 0.00 1483
4 0.00 0.00 0.00 110
accuracy 0.04 4620
macro avg 0.01 0.20 0.02 4620
weighted avg 0.00 0.04 0.00 4620
Test set accuracy
precision recall f1-score support
0 0.05 1.00 0.09 220
1 0.00 0.00 0.00 821
2 0.00 0.00 0.00 2061
3 0.00 0.00 0.00 1414
4 0.00 0.00 0.00 103
accuracy 0.05 4619
macro avg 0.01 0.20 0.02 4619
weighted avg 0.00 0.05 0.00 4619
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
This probably has resulted in the gradient exploding, so the learning rate will be reduced.
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([devX.shape[1],32,16,testY.shape[1]], 0.0005, 32)
model.fit(devX, devY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(devX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
print(classification_report(devY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19) Outputs shape is (4619, 5)
100%|██████████| 100/100 [00:24<00:00, 4.05it/s]
Training set accuracy
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
precision recall f1-score support
0 0.00 0.00 0.00 211
1 0.51 0.38 0.43 808
2 0.48 0.78 0.59 2030
3 0.52 0.26 0.35 1464
4 0.00 0.00 0.00 107
accuracy 0.49 4620
macro avg 0.30 0.28 0.28 4620
weighted avg 0.47 0.49 0.45 4620
Test set accuracy
precision recall f1-score support
0 0.00 0.00 0.00 207
1 0.52 0.39 0.45 834
2 0.49 0.79 0.61 2039
3 0.53 0.26 0.35 1433
4 0.00 0.00 0.00 106
accuracy 0.50 4619
macro avg 0.31 0.29 0.28 4619
weighted avg 0.47 0.50 0.46 4619
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
Using an even lower learning rate to further compensate for the large gradient...
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([devX.shape[1],32,16,testY.shape[1]], 0.00005, 32)
model.fit(devX, devY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(devX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
print(classification_report(devY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19) Outputs shape is (4619, 5)
100%|██████████| 100/100 [00:20<00:00, 4.99it/s] C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
Training set accuracy
precision recall f1-score support
0 0.00 0.00 0.00 207
1 0.00 0.00 0.00 832
2 0.44 1.00 0.61 2039
3 0.00 0.00 0.00 1441
4 0.00 0.00 0.00 101
accuracy 0.44 4620
macro avg 0.09 0.20 0.12 4620
weighted avg 0.19 0.44 0.27 4620
Test set accuracy
precision recall f1-score support
0 0.00 0.00 0.00 211
1 0.00 0.00 0.00 810
2 0.44 1.00 0.61 2030
3 0.00 0.00 0.00 1456
4 0.00 0.00 0.00 112
accuracy 0.44 4619
macro avg 0.09 0.20 0.12 4619
weighted avg 0.19 0.44 0.27 4619
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
Based on this, and on the dimishing returns, it seems safe to say that the batch size of 16 should be the best for what is necessary here. Next I will try using the sigmoid derivative instead of ReLU to see if better results can be achieved.
class FeedforwardNeuralNetworkSGD:
def __init__(self, layers, alpha = 0.1, batchSize = 32):
self.W = []
self.layers = layers
self.alpha = alpha
self.batchSize = batchSize
for i in np.arange(0, len(layers) - 2):
self.W.append(np.random.randn(layers[i] + 1, layers[i + 1] + 1))
self.W.append(np.random.randn(layers[-2] + 1, layers[-1]))
def sigmoid(self, x):
return 1.0 / (1 + np.exp(-x))
def sigmoidDerivative(self, z):
return z * (1 - z)
def getNextBatch(self, X, y, batchSize):
for i in np.arange(0, X.shape[0], batchSize):
yield (X[i:i + batchSize], y[i:i + batchSize])
def fit(self, X, y, epochs = 10000, update = 1000):
X = np.hstack((X, np.ones([X.shape[0],1])))
for epoch in np.arange(0,epochs):
p = np.arange(0,X.shape[0])
np.random.shuffle(p)
X = X[p]
y = y[p]
for (x, target) in self.getNextBatch(X, y, self.batchSize):
A = [np.atleast_2d(x)]
for layer in np.arange(0, len(self.W)):
net = A[layer].dot(self.W[layer])
out = self.sigmoid(net)
A.append(out)
error = A[-1] - target
D = [error * self.sigmoidDerivative(A[-1])]
for layer in np.arange(len(A) - 2, 0, -1):
delta = D[-1].dot(self.W[layer].T)
delta = delta * self.sigmoidDerivative(A[layer])
D.append(delta)
D = D[::-1]
for layer in np.arange(0, len(self.W)):
self.W[layer] -= self.alpha * A[layer].T.dot(D[layer])
if (epoch + 1) % update == 0:
loss = self.computeLoss(X,y)
print('Epoch =', epoch + 1, '\t loss =', loss)
def predict(self, X, addOnes = True):
p = np.atleast_2d(X)
if addOnes:
p = np.hstack((p, np.ones([X.shape[0],1])))
for layer in np.arange(0, len(self.W)):
p = self.sigmoid(np.dot(p, self.W[layer]))
return p
def computeLoss(self, X, y):
y = np.atleast_2d(y)
predictions = self.predict(X, addOnes = False)
loss = np.sum((predictions - y)**2) / 2.0
return loss
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([devX.shape[1],32,16,testY.shape[1]], 0.005, 16)
model.fit(devX, devY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(devX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
print(classification_report(devY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19)
Outputs shape is (4619, 5)
Epoch = 10 loss = 1403.2865075239638
Epoch = 20 loss = 1295.7698053970717
Epoch = 30 loss = 1192.7432920298759
Epoch = 40 loss = 1078.846815204404
Epoch = 50 loss = 963.7294660024506
Epoch = 60 loss = 857.6145384685033
Epoch = 70 loss = 762.6575790024502
Epoch = 80 loss = 672.1129920592439
Epoch = 90 loss = 593.2913987115348
Epoch = 100 loss = 527.9205566488499
Training set accuracy
precision recall f1-score support
0 0.99 0.62 0.77 202
1 0.86 0.76 0.80 834
2 0.88 0.97 0.92 2045
3 0.90 0.96 0.93 1424
4 0.00 0.00 0.00 115
accuracy 0.89 4620
macro avg 0.73 0.66 0.68 4620
weighted avg 0.87 0.89 0.87 4620
Test set accuracy
precision recall f1-score support
0 0.99 0.55 0.71 216
1 0.83 0.78 0.81 808
2 0.89 0.97 0.93 2024
3 0.91 0.96 0.94 1473
4 0.00 0.00 0.00 98
accuracy 0.89 4619
macro avg 0.73 0.65 0.68 4619
weighted avg 0.88 0.89 0.88 4619
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
This is speculation, but it seems like it could be the case that the reason why it does better with sigmoid is because of the fact that the activation for sigmoid makes use of $e$, and that many of the inputs are logarithmic.
Finally, wrapping things up and using the training set on this last setup, we get:
(trainX, restX, trainY, restY) = train_test_split(inputs, outputs, test_size = 0.20, random_state = 1)
(testX, devX, testY, devY) = train_test_split(restX, restY, test_size = 0.50)
trainY = to_categorical(trainY, 5)
devY = to_categorical(devY, 5)
testY = to_categorical(testY, 5)
print("Inputs shape is ",inputs.shape )
print("Outputs shape is ",testY.shape)
model = FeedforwardNeuralNetworkSGD([trainX.shape[1],32,16,trainY.shape[1]], 0.005, 16)
model.fit(trainX, trainY, 100, 10)
print("Training set accuracy")
predictedY = model.predict(trainX)
predictedY = predictedY.argmax(axis=1)
devY = devY.argmax(axis=1)
trainY = trainY.argmax(axis=1)
print(classification_report(trainY, predictedY))
print("Test set accuracy")
predictedY = model.predict(testX)
predictedY = predictedY.argmax(axis=1)
testY = testY.argmax(axis=1)
print(classification_report(testY, predictedY))
Inputs shape is (46194, 19)
Outputs shape is (4619, 5)
Epoch = 10 loss = 4831.3394595768095
Epoch = 20 loss = 2637.7142878897885
Epoch = 30 loss = 2120.8178476481594
Epoch = 40 loss = 1535.1877282946155
Epoch = 50 loss = 1315.9008743506008
Epoch = 60 loss = 1284.4725078167007
Epoch = 70 loss = 1331.1564977827568
Epoch = 80 loss = 1055.6417006823713
Epoch = 90 loss = 934.3459501095784
Epoch = 100 loss = 973.1230600022394
Training set accuracy
precision recall f1-score support
0 1.00 0.97 0.98 1662
1 0.99 0.97 0.98 6878
2 0.99 0.98 0.98 16159
3 0.91 1.00 0.95 11482
4 0.00 0.00 0.00 774
accuracy 0.96 36955
macro avg 0.78 0.78 0.78 36955
weighted avg 0.95 0.96 0.95 36955
Test set accuracy
precision recall f1-score support
0 0.99 0.96 0.98 188
1 0.99 0.96 0.98 836
2 0.98 0.97 0.98 2004
3 0.90 1.00 0.95 1484
4 0.00 0.00 0.00 107
accuracy 0.96 4619
macro avg 0.77 0.78 0.78 4619
weighted avg 0.94 0.96 0.95 4619
C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result)) C:\Python39\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))
Thus, we get pretty good results in the end. The pattern provided is probably due to the neural network sort of figuring out that by multiplying the views/subscriber ratio times the subscribers (or rather, by adding since they are logarithms), one can get a sort of "baseline" amount of views, given that most subscribers, especially for big channels, will watch the video. Thus, if a channel has about 1M subscribers, and a view/subscriber ratio of about 1.5, it can likely predict that the video will easily get about 1.5M views as a sort of baseline (maybe more, maybe less). Of course, this is only speculation.
The best results were obtained
Like I mentioned before, the focus of this project was data collection, which is why I would like to build on top o f the effort realized here in order to be able to make more accurate predictions using more important features such as thumbnails, title content, etc... using the more sophisticated neural network types talked about later in the course.
One key note to consider about my approach to views is that it is rather naive in that it does not take into account the role of the YouTube algorithm. What this means is that while generally, videos with interesting or eye catching thumbnails, titles, and channels willl be more likely to get views, there is still an element of spontaneity that is present when taking into account the YouTube algorithm. This is especailly true when one considers the changes that have been made to it over the years in order to satisfy different objectives such as:
I would like to give credit to Corey Schafer for his helpful guide and code sample on how to use the YouTube Data API in Python (specifically for generating the access code).
Picture of the computers used at Evans Hall to facilitate the collection process.
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
image = mpimg.imread("images/collection.jpeg")
plt.figure(figsize=(200,100))
plt.imshow(image)
plt.show()